Urban forest is more important than you think.

We all know some benefits of trees like cleaning the air from CO2 and aesthetic reasons.

But there are more to it than you think:

  • Carbon sequestration/storage
  • Promoting diverse flora
  • Barrier to noisy traffic
  • Each large front yard tree adds about 1% to sales price of the property
  • Trees reduce stormwater runoff by capturing and storing rainfall in their canopy and promote the infiltration of rainwater into the soil

But what exactly goes into sustaining the life of the trees and what are the risks of not taking it seriously? Here's where urban forest monitoring, prevention and mitigation comes in. Increasing public safety and examining conditions of the trees non-destructively to develop a plan of action. preventing sidewalk damage. Monitoring invasive norway maple that lessen diversity of other trees and living habitat as well as Ash trees (Fraxinus spp.) in preparation for an invasion of emerald ash borer. budget allocation for renewal and pruning.

This project is going to aim to help to identify, predict possible risk and areas of attention.

For this project I'll be working with few datasets:

New-York tree inventory data taken from NYC Open Data.

Urban street tree inventory data for Newburgh, New York in 2015

Finally, Historic Tree Inventory - 2018/2019 within the City of Buffalo

Since NYC tree inventory has more features and observations It'll be the training dataset. The other two I'll test predictions on.

Feature Engineering

the data set doesn't have many useful of features to work with but there's some possibility for feature engineering. like to combine address and street name to find out longitude and latitude of trees using geopy for ploting them using plotly on the map.

we can use a formula to make predictions of the age according to the species to create even more features.

for visualization

make a year slider connected to a growing riskfactor areas with high tree pollen

uploading dataset set of a different city

With a high ephasis on urban forest it's important to it healthy and hazard free.

to find out which trees corresponds to what degree of allergy severity, I used this website. due to lack of time I have not scraped the data(which in the future I should), but I copied manually into spreadsheet. The plan is to merge dataframes on botanical name

After training a model to predict risk rating it would be possible to upload the data of tree inventory from another city(given its formatted accordingly) to see a forecast of the trees that need attention.

Data Wrangling

NYC tree inventory contains close to a million observations. This dataset is updated regularly. with this large dataset we wouldn worry about dropping missing values. Wrangling will be important for the dataset to contain necessary variables and be memory efficient.

df_ny_trees=pd.read_csv('Forestry_Tree_Points.csv')
df_ny_trees.columns=df_ny_trees.columns.str.lower()
df_ny_trees=df_ny_trees.drop(['objectid','plantingspaceglobalid',
                              'geometry','globalid','riskratingdate',
                              'planteddate','createddate','updateddate','stumpdiameter'],axis=1)
df_ny_trees=df_ny_trees.dropna()
df_ny_trees=df_ny_trees.rename(columns={'genusspecies':'botanical_name'})
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.lower()
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.split('-').str[0]
df_ny_trees['tpcondition']=df_ny_trees['tpcondition'].str.lower()
df_ny_trees['tpstructure']=df_ny_trees['tpstructure'].str.lower()
df_ny_trees['dbh']=df_ny_trees['dbh'].astype(int)
df_ny_trees['riskrating']=df_ny_trees['riskrating'].astype(int)
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.split('-').str[0]
df_ny_trees.drop(df_ny_trees[df_ny_trees['tpcondition']=='unknown'].index, inplace=True)
df_ny_trees['latitude']=(df_ny_trees['location']
                         .apply(lambda x: x.split('(')[-1].strip(')').split(','))
                         .apply(lambda x:x[0])).astype(float)
df_ny_trees['longitude']=(df_ny_trees['location']
                          .apply(lambda x: x.split('(')[-1].strip(')').split(','))
                          .apply(lambda x:x[1])).astype(float)
df_ny_trees=df_ny_trees.drop('location',axis=1)
df_ny_trees.loc[df_ny_trees['botanical_name']=="prunus serrulata 'green leaf' ",'botanical_name']='prunus serrulata'
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.extract('(^[,. A-Za-z]*[A-Za-z])')

df_ny_trees=df_ny_trees[~(df_ny_trees['dbh']>40)]
drop_rows=df_ny_trees['botanical_name'].value_counts()[(df_ny_trees['botanical_name'].value_counts()<50)].index

df_ny_trees=df_ny_trees[~df_ny_trees['botanical_name'].isin(drop_rows)]
# df_ny_trees=df_ny_trees[~df_ny_trees['botanical_name'].isin(['tilia x euchlora', 'thuja', 'pinus ponderosa', 'picea omorika',
#  'stewartia koreana', 'catalpa ovata', 'sorbus hybrida', 'hamamelis x intermedia','prunus cistena'])]
# df_ny_trees['botanical_name'].str.extract('(.+[A-za-z]*(?= var))')
df_ny_trees.shape
(297782, 7)
df_ny_trees.head()
dbh tpstructure tpcondition botanical_name riskrating latitude longitude
0 26 full good quercus palustris 7 40.863350 -73.906594
2 30 full fair quercus palustris 8 40.710290 -73.833408
4 5 full good quercus phellos 3 40.727396 -74.007550
5 10 full fair acer platanoides 8 40.813687 -73.943579
6 36 full fair fraxinus americana 6 40.856531 -73.790451

After wrangling, this dataset reduced to around 300_000 observations. For ease of use I'll make a separate csv and load the data from there.

df_ny_trees
dbh tpstructure tpcondition botanical_name riskrating latitude longitude
0 26 full good quercus palustris 7 40.863350 -73.906594
2 30 full fair quercus palustris 8 40.710290 -73.833408
4 5 full good quercus phellos 3 40.727396 -74.007550
5 10 full fair acer platanoides 8 40.813687 -73.943579
6 36 full fair fraxinus americana 6 40.856531 -73.790451
... ... ... ... ... ... ... ...
978278 11 full fair tilia cordata 6 40.703732 -73.957353
978279 13 full good pyrus calleryana 6 40.882492 -73.859594
978280 6 full good zelkova serrata 3 40.798664 -73.963754
978282 23 full good quercus palustris 6 40.870463 -73.870709
978283 22 full fair platanus x acerifolia 8 40.671471 -73.879124

299458 rows × 7 columns

df_ny_trees.to_csv('df_ny_trees_wrangled.csv',index=False)

df_ny_trees_wrangled=pd.read_csv('df_ny_trees_wrangled.csv')
df_ny_trees_wrangled.groupby(['tpcondition','dbh'])['riskrating'].value_counts().sort_values(ascending=False).head(30)
tpcondition  dbh  riskrating
good         3    3             5589
             4    3             5397
             5    3             4962
             6    3             4157
             7    3             3150
             8    3             2804
             9    3             2059
             10   3             2044
             2    3             1881
             11   3             1806
             12   3             1680
             8    6             1475
             6    5             1409
             13   3             1387
             7    6             1379
             12   6             1365
             5    5             1364
             10   6             1352
             6    6             1348
             11   6             1342
fair         11   6             1332
good         8    5             1318
             9    6             1306
fair         14   7             1302
good         14   6             1299
fair         12   6             1273
good         14   3             1270
fair         14   6             1262
             13   6             1260
good         13   6             1253
Name: riskrating, dtype: int64

Buffalo Dataset

The idea is to make test datasets that with the same features. Once they are unified it shound be easy to upload them to discover insights.

df_buffalo=pd.read_csv('Historic_Tree_Inventory_-_2018_2019.csv')
df_buffalo.columns=df_buffalo.columns.str.lower()

df_buffalo=df_buffalo.drop(columns=[
        'editing','total yearly eco benefits ($)', 'stormwater benefits ($)',
       'stormwater gallons saved', 'greenhouse co2 benefits ($)',
       'co2 avoided (in lbs.)', 'co2 sequestered (in lbs.)',
       'energy benefits ($)', 'kwh saved', 'therms saved',
       'air quality benefits ($)', 'pollutants saved (in lbs.)',
       'property benefits ($)','address','leaf surface area (in sq. ft.)',
       'street', 'side', 'site', 'council district', 'park name', 'site id', 'location'])
df_buffalo['common name']=df_buffalo['common name'].str.lower()
df_buffalo['botanical name']=df_buffalo['botanical name'].str.lower()
df_buffalo=df_buffalo.dropna()
df_buffalo['dbh']=df_buffalo['dbh'].astype(int)
df_buffalo['botanical name']=df_buffalo['botanical name'].str.extract('(^[,. A-Za-z]*[A-Za-z])')
df_buffalo
df_buffalo.shape
df_buffalo['dbh']
0          0
1          0
2          0
3          0
4          0
          ..
132491     4
132492    22
132494     7
132495     8
132496    22
Name: dbh, Length: 132471, dtype: int32

Allergy dataset

This is a bonus dataset to be merged with our datasets to visualize pollen allergy risk given out by the trees.

allergy=pd.read_csv('pollen.csv',header=None,names=['trees','allergy'])
allergy=(allergy
         .drop_duplicates()
         .reset_index(drop=True)
         .fillna(0)
         .convert_dtypes(int))
allergy['trees']=allergy['trees'].str.lower()
allergy['botanical_name']=allergy['trees'].apply(lambda x: x.split('(')[-1].strip(')'))
allergy=allergy.drop(columns='trees')
allergy=allergy[['botanical_name','allergy']]

allergy

Next up is the Newburgh dataset.

Newburgh dataset

df_newburgh=pd.read_csv('Z1535_3079_DOVK2R.csv')
df_newburgh.columns=df_newburgh.columns.str.lower()
df_newburgh['botanical_name']=(df_newburgh['species']
                  .apply(lambda x: x.split('(')[-1].strip(')'))
                  .str.lower())
df_newburgh['species']=df_newburgh['species'].str.extract('([, A-Za-z]*(?![^(]*\)))')
df_newburgh=df_newburgh.drop(['suffix','cultivar'],axis=1)
# df.loc[df['dbh']>40,'dbh']/3.14.round(1)
df_newburgh=df_newburgh.drop(columns=['side', 'site', 'on_street', 'inventory_date','site_id','area', 'stems','species'])
df_newburgh.loc[df_newburgh['botanical_name'].str.contains('vacant'),'botanical_name']='vacant'
df_newburgh.loc[df_newburgh['dbh']>=40,'dbh']=(df_newburgh.loc[df_newburgh['dbh']>=40,'dbh']/3.14).round().astype(int)
df_newburgh['street']=df_newburgh['street'].str.lower()
df_newburgh.shape
(8037, 4)
df_newburgh
address street dbh botanical_name
0 251 farrington st 6 quercus palustris
1 251 farrington st 0 vacant
2 251 farrington st 0 vacant
3 251 farrington st 10 pinus strobus
4 251 farrington st 6 pinus strobus
... ... ... ... ...
8032 81 water st 9 fraxinus pennsylvanica
8033 94 water st 15 fraxinus pennsylvanica
8034 94 water st 15 fraxinus pennsylvanica
8035 94 water st 13 fraxinus pennsylvanica
8036 81 water st 11 acer rubrum

8037 rows × 4 columns

df['species'].value_counts(normalize=True).head()*100
vacant                     42.055493
maple, Norway              14.781635
stump                       4.777902
pear, Callery               4.479283
honeylocust, thornless      3.794948
Name: species, dtype: float64

As we can see the most popular tree is Norway mapple. after some research it considers to be invasive.

Here are few concerns:

Norway maple (Acer platanoides) is a large deciduous tree that can grow up to approximately 40-60 feet in height. They are tolerant of many different growing environments and have been a popular tree to plant on lawns and along streets because of their hardiness. Norway maples have very shallow roots and produce a great deal of shade which makes it difficult for grass and other plants to grow in the understory below. In urban environments, the root systems also destroy pavement, requiring expensive repairs. Other species of flora and fauna, such as insects and birds, may indirectly be affected due to the change in resource diversity and availability. Additionally, they are prolific seed producers and are now invading forests and forest edges.

Collecting the data is the most time intensive task, since I'm not a domain expert and have to learn fast and the fact of not having enough features. A lot of research had to be done-detective work.

Next I've found a tree growth coefficets table with formulas to calculate tree age, height and etc.

df_newburgh['full_Address']=(df_newburgh['address']
                             .astype(str)+' '+df_newburgh['street']
                             .astype(str)+', newburgh, ny')
df_newburgh
address street dbh botanical_name full_Address
0 251 farrington st 6 quercus palustris 251 farrington st, newburgh, ny
1 251 farrington st 0 vacant 251 farrington st, newburgh, ny
2 251 farrington st 0 vacant 251 farrington st, newburgh, ny
3 251 farrington st 10 pinus strobus 251 farrington st, newburgh, ny
4 251 farrington st 6 pinus strobus 251 farrington st, newburgh, ny
... ... ... ... ... ...
8032 81 water st 9 fraxinus pennsylvanica 81 water st, newburgh, ny
8033 94 water st 15 fraxinus pennsylvanica 94 water st, newburgh, ny
8034 94 water st 15 fraxinus pennsylvanica 94 water st, newburgh, ny
8035 94 water st 13 fraxinus pennsylvanica 94 water st, newburgh, ny
8036 81 water st 11 acer rubrum 81 water st, newburgh, ny

8037 rows × 5 columns

Getting longitude and latitude with Geopy

To find out location of each tree I'm combining address and street in to new column.

Having a full address and using geopy I'm obtaining longitude and latitude data.

Depending on the dataset It'll take a while. Best suited for overnight task.

locator = Nominatim(user_agent='myGeocoder')
# delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
# creating location column
df_newburgh['location'] = df_newburgh['full_address'].apply(geocode)
# creating longitude, laatitude and altitude from location column (returns tuple) 
df_newburgh['point'] = df_newburgh['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
df_newburgh[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df_newburgh['point'].tolist(), index=df.index)
 
df_newburgh.to_csv('newburgh trees.csv',index=False)
df_newburgh=pd.read_csv('newburgh trees.csv')
df_newburgh=df_newburgh.drop(columns=['address','street','full_Address'])
df_newburgh
dbh botanical_name latitude longitude
0 6 quercus palustris 41.504998 -74.012407
1 0 vacant 41.504998 -74.012407
2 0 vacant 41.504998 -74.012407
3 10 pinus strobus 41.504998 -74.012407
4 6 pinus strobus 41.504998 -74.012407
... ... ... ... ...
8032 9 fraxinus pennsylvanica 41.500394 -74.006753
8033 15 fraxinus pennsylvanica 41.498959 -74.007473
8034 15 fraxinus pennsylvanica 41.498959 -74.007473
8035 13 fraxinus pennsylvanica 41.498959 -74.007473
8036 11 acer rubrum 41.500394 -74.006753

8037 rows × 4 columns

X=df_ny_trees_wrangled.drop(columns=['riskrating','latitude', 'longitude'])
y=df_ny_trees_wrangled['riskrating']
X
dbh tpstructure tpcondition botanical_name
0 26 full good quercus palustris
1 30 full fair quercus palustris
2 5 full good quercus phellos
3 10 full fair acer platanoides
4 36 full fair fraxinus americana
... ... ... ... ...
297777 11 full fair tilia cordata
297778 13 full good pyrus calleryana
297779 6 full good zelkova serrata
297780 23 full good quercus palustris
297781 22 full fair platanus x acerifolia

297782 rows × 4 columns

baseline=y.value_counts(normalize=True).max()*100
baseline
3     62191
6     59714
7     48359
5     37836
8     34885
9     22278
4     18951
10     9329
11     2823
12     1319
2        51
1        27
0        19
Name: riskrating, dtype: int64
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=43,test_size=.2)
model=make_pipeline(OrdinalEncoder(),
                   SimpleImputer(),
                   RandomForestClassifier(n_estimators=2000,
#                                          max_depth=10,
                                         n_jobs=-1))

model.fit(X_train, y_train)

model.score(X_train,y_train),model.score(X_test,y_test)
(0.403584846258789, 0.3259398559363299)
model=make_pipeline(OrdinalEncoder(),
                   SimpleImputer(),
                   XGBRFClassifier(n_estimators=2000,
#                                          max_depth=10,
                                         n_jobs=-1))

model.fit(X_train, y_train)

model.score(X_train,y_train),model.score(X_test,y_test)
C:\Users\evhorrosh\Desktop\VSTEST\env\lib\site-packages\xgboost\sklearn.py:1146: UserWarning:

The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].

[19:29:17] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(0.3368202329730297, 0.33482210319525835)
model.predict(X_test)
array([7, 6, 6, ..., 9, 6, 3], dtype=int64)
model.predict_proba(X_test)
array([[3.68813159e-07, 5.07330653e-06, 1.58596564e-04, ...,
        3.33908219e-02, 6.91120536e-03, 2.91432095e-03],
       [1.10042314e-05, 3.12304809e-06, 1.62920860e-05, ...,
        3.65635308e-02, 7.49277362e-03, 6.02827178e-03],
       [1.01586822e-04, 3.15933337e-05, 2.08352290e-04, ...,
        8.53981764e-03, 2.26245610e-03, 9.88730780e-04],
       ...,
       [1.84226814e-04, 0.00000000e+00, 0.00000000e+00, ...,
        1.88155941e-01, 4.35123458e-02, 1.52336164e-02],
       [0.00000000e+00, 8.18169717e-06, 1.13826340e-05, ...,
        2.23799042e-02, 3.63439880e-03, 1.99009936e-03],
       [5.26315789e-07, 3.11884762e-05, 2.09675508e-04, ...,
        1.20994063e-03, 5.12201791e-04, 3.74612860e-04]])
# #                       color_discrete_sequence=['Purple'],
#                       zoom=10, 
#                       height=900,
#                       opacity=.1,
#                       size='dbh',
#                       color='dbh')
# fig = go.Figure()
fig.add_trace(
    go.Scatter(
        mode='markers',
#         x=[2],
#         y=[4.5],
        marker=dict(
            color='LightSkyBlue',
            size=120,
            line=dict(
                color='MediumPurple',
                width=12
            )
        ),
        showlegend=False
    )
)            

fig.show()

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

# fig.show()

Limitations/Challenges

The main challenge is to getting insight from the data that is out of my expertise. This impacted on what questions I can ask and answer. Although there are way to make the data work by engineering features challenging part was to stay on course and don't complicate without a need.

  • the trees measured were located on a public property only.
  • limited measured features. given a table of equations and table of cooefs some features can be engineered but they still wont reflect reality